Hybrid Indexes for Repetitive Datasets
نویسندگان
چکیده
Advances in DNA sequencing mean that databases of thousands of human genomes will soon be commonplace. In this paper, we introduce a simple technique for reducing the size of conventional indexes on such highly repetitive texts. Given upper bounds on pattern lengths and edit distances, we pre-process the text with the lossless data compression algorithm LZ77 to obtain a filtered text, for which we store a conventional index. Later, given a query, we find all matches in the filtered text, then use their positions and the structure of the LZ77 parse to find all matches in the original text. Our experiments show that this also significantly reduces query times.
منابع مشابه
CHICO: A Compressed Hybrid Index for Repetitive Collections
Indexing text collections to support pattern matching queries is a fundamental problem in computer science. New challenges keep arising as databases grow, and for repetitive collections, compressed indexes become relevant. To successfully exploit the regularities of repetitive collections different approaches have been proposed. Some of these are Compressed Suffix Array, Lempel-Ziv, and Grammar...
متن کاملA hybrid filter-based feature selection method via hesitant fuzzy and rough sets concepts
High dimensional microarray datasets are difficult to classify since they have many features with small number ofinstances and imbalanced distribution of classes. This paper proposes a filter-based feature selection method to improvethe classification performance of microarray datasets by selecting the significant features. Combining the concepts ofrough sets, weighted rough set, fuzzy rough se...
متن کاملHybrid Indexes to Expedite Spatial-Visual Search
Due to the growth of geo-tagged images, recent web and mobile applications provide search capabilities for images that are similar to a given query image and simultaneously within a given geographical area. In this paper, we focus on designing index structures to expedite these spatial-visual searches. We start by baseline indexes that are straightforward extensions of the current popular spati...
متن کاملCHIC: a short read aligner for pan-genomic references
Recently the topic of computational pan-genomics has gained increasing attention, and particularly the problem of moving from a single-reference paradigm to a pan-genomic one. Perhaps the simplest way to represent a pan-genome is to represent it as a set of sequences. While indexing highly repetitive collections has been intensively studied in the computer science community, the research has fo...
متن کاملThe effect of combining low frequency repetitive trans-cranial magnetic stimulation and conventional rehabilitation in improving functional behavior of hemiplegic patients
Purpose: Some new methods of treatment focus on using magnetic stimulation as a means of induction currents in the brain to produce therapeutic effects. The aim of this clinical trial was to determine the effects of repetitive transcranial magnetic stimulation (rTMS) plus routine rehabilitation on hand grip and wrist motor function in hemiplegic patients.Materials and Methods: Twelve hemiplegic...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Philosophical transactions. Series A, Mathematical, physical, and engineering sciences
دوره 372 2016 شماره
صفحات -
تاریخ انتشار 2014